I'm trying to analyze data and traffic collision patterns from this NYPD Motor Vehicle Collision Dataset
loading the libraries
library(tidyverse)
library(plotly)
library(ggplot2)
reading the data
collision<-read_csv("NYPD_Motor_Vehicle_Collisions.csv")
glimpse(collision)
collision=separate(collision,DATE,c("MONTH","DAY","YEAR"),sep="/")
collision=separate(collision,TIME,c("HourOfCollision","Min","Sec"),sep=":")
collision = subset(collision, select = -c(Min,Sec) )
collision$YEAR<-as.factor(collision$YEAR)
collision$MONTH<-as.factor(collision$MONTH)
collision$DAY<-as.factor(collision$DAY)
collision$HourOfCollision<-as.factor(collision$HourOfCollision)
collision$BOROUGH<-as.factor(collision$BOROUGH)
collision$LOCATION<-as.factor(collision$LOCATION)
summary(collision)
table(collision$YEAR)/nrow(collision)
CollisionByYear = collision %>% count(YEAR)
YEARplot = CollisionByYear %>%
plot_ly(labels=~YEAR, values=~n) %>%
add_pie(hole = 0.4) %>%
layout(title = "Donut Chart for Collisions per Year", showlegend = T,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
YEARplot
2017 saw the highest collision rate of 18.94%. We have lesser collision in 2018 at 2%. However, this may be because we have just covered two months yet.
collision$"NUMBER OF PERSONS KILLED"<-as.numeric(collision$"NUMBER OF PERSONS KILLED")
table(collision$"NUMBER OF PERSONS KILLED",collision$BOROUGH)
which.max(collision$"NUMBER OF PERSONS KILLED")
collision[64321,]
The deadliest collision (based on the number of persons killed) took place on 31st October 2017 in Manhattan.
This collision also took place in the evening hours.
#table(collision$LOCATION,collision$"NUMBER OF PERSONS KILLED")
casualties = collision %>%
group_by(BOROUGH) %>%
summarise(TotalNumberOfPedestriansKilled=sum(`NUMBER OF PEDESTRIANS KILLED`),
TotalNumberOfCyclistKilled=sum(`NUMBER OF CYCLIST KILLED`),
TotalNumberOfMotoristKilled=sum(`NUMBER OF MOTORIST KILLED`))
casualties
plotTwo = plot_ly(casualties,x=~BOROUGH,y=~TotalNumberOfPedestriansKilled,type = 'bar',
name='Pedestrians Killed') %>%
add_trace(y=~TotalNumberOfCyclistKilled,name='Cyclist Killed')%>%
add_trace(y=~TotalNumberOfMotoristKilled,name='Motorist Killed')%>%
layout(yaxis = list(title = 'Count'), barmode = 'stack')
plotTwo
As per the data, brooklyn turns out to be the area where the most collisions occour. Staten Island turns out to be a borough not many collisions happen as compared to the rest.
Also, looking at the graph, I see that cyclists are the safest on the road. However, we cannot rule out the fact that there are less number of cyclists on the road and hence less collisions(maybe).
It is safer to be a cyclist in Staten Island than it is to be in any other borough.
However, it is pedestrians who are the least safe in case of collisions.
injured = collision %>%
group_by(BOROUGH) %>%
summarise(TotalNumberOfPedestriansInjured=sum(`NUMBER OF PEDESTRIANS INJURED`),
TotalNumberOfCyclistInjured=sum(`NUMBER OF CYCLIST INJURED`),
TotalNumberOfMotoristInjured=sum(`NUMBER OF MOTORIST INJURED`))
injured
plotThree = plot_ly(injured,x=~BOROUGH,y=~TotalNumberOfPedestriansInjured,type = 'bar',
name='Pedestrians Injured') %>%
add_trace(y=~TotalNumberOfCyclistInjured,name='Cyclist Injured')%>%
add_trace(y=~TotalNumberOfMotoristInjured,name='Motorist Injured')%>%
layout(yaxis = list(title = 'Count'), barmode = 'stack')
plotThree
Motorists are the highest injured, even though do not ket killed in collisions as compared to the pedestrians.Also, it is the least safe to drive a motorbike in Brooklyn.
sort(table(collision$HourOfCollision))
We observer that the maximum collisions take place during the evening hours and the least during late nights.
which.max(collision$LOCATION)
collision[125837, ]
table(collision$HourOfCollision,collision$BOROUGH)/nrow(collision)
Brooklyn observes highest relative collision rate at 0.0167098228 at 4:00 PM in the evening.
tapply(collision$"NUMBER OF PEDESTRIANS KILLED",collision$BOROUGH,mean)
BoroughVCollision = collision %>%
count(HourOfCollision,BOROUGH)
plot_ly(BoroughVCollision, x=~BOROUGH,y=~HourOfCollision,z=~n,
colors = colorRamp(c("green", "red")),type="heatmap")